Data exploration is not only about creating numbers and summary statistics. Sometimes a beautiful plot reveals more exciting insights into data. In this exercise, we exploit what we’ve just learned about plots in R and in particular in ggplot2. Now we’re going to use all of the gapminder GDP data!
filter()-function for choosing the individual time periods.
gapminder_ggplot_input <-
readxl::read_excel(
path = "../data/gapminder/GDPpercapitaconstant2000US.xlsx",
sheet = "Data"
) %>%
rename(country = `Income per person (fixed 2000 US$)`) %>%
gather(-country, key = "year", value = "GDP") %>%
filter(!is.na(GDP)) %>%
arrange(year, GDP) %>%
group_by(year) %>%
summarise(GDP_over_all_countries = mean(GDP))
Previously, we only have analyzed how the period of 1960-1969 compares to the period of 2002-2011. The nice thing about plots is that we can make use of the whole range of years and still identify differences between various periods. Our plot of choice, therefore, is a line plot to create a nice time series.
geom_point as in the slides, the geom’s name is geom_line. Moreover, in the aesthetics definition aes() you may want to define a grouping variable group = 1; otherwise, ggplot thinks you want to plot one line for each year.
ggplot(
data = gapminder_ggplot_input,
aes(x = year, y = GDP_over_all_countries, group = 1)
) +
geom_line()
Admittedly, this may not be the best approach to identify differences between the periods directly. We don’t know when our periods start and when they end. Luckily, this can be fixed using at least two approaches. Let’s start with the first one: using colors for different periods. For this purpose, we need an indicator variable as a grouping variable that applies different colors to the line at each period.
mutate() and the if_else lets you create new variables rather easily. Moreover, to get some sensible legend labels later define them as strings.
gapminder_ggplot_input <-
gapminder_ggplot_input %>%
mutate(
period =
if_else(
year >= 1960 & year <= 1969,
"1960-1969",
if_else(
year >= 2002 & year <= 2011,
"2002-2011",
"1970-2001")
)
)
After we’re set up with our indicator variable, it’s plotting time again. We can simply re-use our code from before and define a grouping color in the aesthetics definition. Try it out!
aes(), you can choose the option color = indicator_variable to define the grouping.
ggplot(
data = gapminder_ggplot_input,
aes(
x = year,
y = GDP_over_all_countries,
color = period,
group = 1
)
) +
geom_line()
Now we can see some visual differences between the different periods. One last thing, however, is that there are way too many labels on the x-axis. Maybe a more sensible labeling approach would be to create axis breaks for every ten years steps.
scale_x_discrete() and its breaks with the option breaks = breaks_vector.
ggplot(
data = gapminder_ggplot_input,
aes(
x = year,
y = GDP_over_all_countries,
color = period,
group = 1
)
) +
geom_line() +
scale_x_discrete(
breaks = seq(
from = 1960,
to = 2011,
by = 10
)
)